A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

Ariel Goldstein, Haocheng Wang, Leonard Niekerken, Mariano Schain, Zaid Zada, Bobbi Aubrey, Tom Sheffer, Samuel A. Nastase, Harshvardhan Gazula, Aditi Singh, Aditi Rao, Gina Choe, Catherine Kim, Werner Doyle, Daniel Friedman, Sasha Devore, Patricia Dugan, Avinatan Hassidim, Michael Brenner, Yossi Matias, Orrin Devinsky, Adeen Flinker, Uri Hasson
nature human behaviour
Department of Psychology and the Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigated how the human brain processes natural language during real-world conversations. Researchers recorded brain activity using electrocorticography (ECoG), a technique that involves placing electrodes directly on the brain's surface, while participants engaged in unscripted conversations with family, friends, and hospital staff. This approach provided a rich dataset of approximately 100 hours of continuous recordings, encompassing nearly half a million words. The key innovation was the use of a state-of-the-art, multimodal speech-to-text model called Whisper (developed by OpenAI) to analyze both the audio recordings and the corresponding brain activity. Whisper is a deep learning model, meaning it's a complex algorithm that learns patterns from data, similar to how a brain learns. It's trained to process speech and convert it into text, and it does so by extracting different levels of linguistic information, from the raw sounds (acoustic features) to the recognized speech sounds (speech features) and finally to the meaning of the words (language features). These different levels of information are represented within the model as "embeddings," which are essentially numerical codes that capture different aspects of the language.

The researchers then used a technique called "encoding models" to see how well these embeddings from Whisper could predict the brain activity they recorded. They found that the embeddings could predict brain activity with remarkable accuracy. Moreover, different types of embeddings were better at predicting activity in different brain regions. Speech embeddings, representing the sounds of speech, were more strongly related to activity in areas involved in hearing and producing speech, such as the superior temporal cortex and the precentral gyrus. Language embeddings, representing the meaning of words, were more strongly related to activity in higher-level language areas, such as the inferior frontal gyrus and the angular gyrus. This pattern aligns with the well-established understanding of how language is processed in the brain, with a hierarchical organization from lower-level sensory and motor areas to higher-level cognitive areas.

Furthermore, the study found that the Whisper model outperformed traditional linguistic models, which rely on symbolic representations of language (like parts of speech and grammatical rules). This suggests that deep learning models, which learn statistical patterns from vast amounts of data, may capture aspects of language processing that are not captured by traditional, rule-based approaches. The study also examined the timing of brain activity and found that the model could capture fine-grained temporal patterns during both speech production and comprehension. For example, during speech production, there was evidence of brain activity related to the upcoming word even before the word was spoken, suggesting that the brain plans the entire word in advance. During speech comprehension, the brain activity showed a sequential pattern, with earlier parts of the speech signal being processed earlier in the brain.

The study concludes that unified computational models, like Whisper, offer a promising new framework for studying the neural basis of natural language processing. These models can capture the entire processing hierarchy, from acoustics to meaning, and provide a more comprehensive and naturalistic view of how the brain processes language in real-world situations.

Research Impact and Future Directions

This study provides compelling evidence for a strong correspondence between a unified computational model of language (OpenAI's Whisper) and the neural activity observed in the human brain during natural conversations. The researchers demonstrate that different levels of linguistic representation within the model – acoustic, speech, and language – map onto distinct brain regions, mirroring the known hierarchical organization of language processing in the cortex. The model's ability to predict neural activity with high accuracy, even outperforming traditional symbolic models, suggests that deep learning approaches offer a promising avenue for understanding the complex neural mechanisms underlying language.

The work makes a significant contribution by moving beyond highly controlled experimental settings and investigating language processing in real-world, unconstrained conversations. This ecological approach, combined with the advanced computational modeling, provides a more naturalistic and comprehensive view of how the brain processes language. However, it's crucial to acknowledge that the study's findings are based on correlations between model representations and brain activity. While these correlations are strong and statistically significant, they do not definitively prove that the brain uses the same representations or computational principles as the model. Further research is needed to explore the causal relationships and to determine the extent to which these findings generalize to the broader population, given the study's small sample size of patients with epilepsy.

Despite these limitations, the study represents a significant step forward in bridging the gap between computational linguistics and neuroscience. The findings open up exciting avenues for future research, including investigating the temporal dynamics of language processing in more detail, exploring the role of individual differences, and developing more refined computational models that can capture even finer-grained aspects of neural language processing. The potential applications of this research extend to clinical settings, where a better understanding of the neural basis of language could lead to improved diagnostic and therapeutic tools for language disorders.

Critical Analysis and Recommendations

Clear Summary of Key Findings (written-content)
The abstract effectively summarizes the key findings, highlighting the alignment between the model's processing hierarchy and the brain's cortical hierarchy for speech and language. This concise overview provides readers with a clear understanding of the study's main result, increasing accessibility and impact.
Section: Abstract
Missing Explicit Research Question (written-content)
The abstract does not explicitly state the research question at the beginning. Adding a clear statement of the research question (e.g., "This study investigates how the human brain processes natural language during everyday conversations...") would provide immediate context and improve the abstract's overall clarity and impact.
Section: Abstract
Critique of Traditional Approaches (written-content)
The introduction effectively establishes the limitations of traditional psycholinguistic approaches in capturing the complexities of real-world conversations. This critique sets the stage for the need for a new, unified computational framework, justifying the study's approach.
Section: Introduction
Missing Explicit Research Question (written-content)
The introduction lacks a concise statement of the specific research question being addressed. Adding a sentence like, "This study aims to investigate the neural mechanisms underlying natural language processing during real-world conversations..." would immediately orient the reader to the study's purpose.
Section: Introduction
Accurate Prediction of Neural Activity (written-content)
Whisper's embeddings accurately predicted neural activity during natural conversations (correlations ranging from 0.04 to 0.40, P < 0.01, FWER corrected, Fig. 2). This was demonstrated through encoding models that mapped the model's internal representations onto brain activity recorded via ECoG. This finding provides strong evidence for the alignment between the computational model and brain activity, suggesting that the model captures relevant aspects of neural language processing.
Section: Results
Hierarchical Organization of Encoding (written-content)
Speech embeddings better predicted activity in lower-level speech perception/production areas (superior temporal cortex, precentral gyrus), while language embeddings better predicted activity in higher-order language areas (inferior frontal gyrus, angular gyrus) (Fig. 3). This hierarchical organization was revealed through variance partitioning, quantifying the unique contribution of each embedding type. This finding supports the established understanding of a hierarchical organization of language processing in the brain, extending it to naturalistic conversational settings.
Section: Results
Missing Effect Sizes and Confidence Intervals (written-content)
The Results section lacks consistent reporting of effect sizes and confidence intervals alongside p-values. Including these measures (e.g., Cohen's d, Pearson's r, and their confidence intervals) would provide a more complete picture of the magnitude and reliability of the findings, allowing for a better assessment of practical significance.
Section: Results
Consideration of Different Interpretations (written-content)
The discussion thoughtfully considers different interpretations of the relationship between the model's internal representations and brain activity. It presents both a conservative view (the model learns the transformation between distinct codes) and a more speculative one (the model and brain share computational principles), offering a balanced perspective.
Section: Discussion
Missing Acknowledgment of Limitations (written-content)
The discussion does not explicitly acknowledge the study's limitations, such as the small sample size (N=4) and the specific patient population (individuals with epilepsy). Addressing these limitations would enhance the paper's credibility and provide a more nuanced interpretation of the results.
Section: Discussion
Comprehensive Preprocessing Pipeline (written-content)
The Methods section meticulously describes the preprocessing pipeline for both speech and ECoG recordings. This includes steps for de-identification, transcription, alignment, artifact mitigation, and signal processing, enhancing the reproducibility of the study.
Section: Methods
Incomplete Description of Manual Verification (written-content)
The Methods section does not fully detail the criteria used for manual verification and adjustment of word onset and offset times. Providing more detail on this manual correction process would enhance transparency and allow other researchers to replicate this crucial step.
Section: Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity...
Full Caption

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity during real-world conversations.

Figure/Table Image (Page 2)
Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity during real-world conversations.
First Reference in Text
The Whisper architecture incorporates a multilayer encoder network and a multilayer decoder network (Fig. 1): the encoder maps continuous acoustic inputs into a high-dimensional embedding space, captur- ing speech features which are transferred into a word-level decoder, effectively mapping them into contextual word embeddings21–23.
Description
  • Overview of the experimental and computational approach: This figure outlines the overall experimental and computational approach used in the study. It shows how the researchers recorded brain activity (using electrocorticography, or ECoG, which is like a more detailed EEG that involves placing electrodes directly on the brain's surface) while people were having natural conversations. They simultaneously recorded the audio of these conversations. The audio and transcriptions of the conversations were then fed into a powerful computer model called "Whisper," which is a type of deep learning model. Deep learning models are complex algorithms that learn patterns from data, much like how a brain learns. Whisper is specifically designed to process speech. The figure shows that the researchers extracted different types of information, called "embeddings," from Whisper. These embeddings represent different aspects of the speech, from low-level acoustic features (the raw sounds) to higher-level linguistic information (the meaning of the words). They then used a mathematical technique called linear regression to see how well these embeddings could predict the brain activity they recorded. Linear regression, in simple terms, is like finding the best-fitting line through a set of data points, allowing you to predict one variable based on another.
  • Dense-sampling paradigm: The figure highlights a "dense-sampling paradigm." This refers to the continuous and extensive recording of neural activity (24/7) during real-life conversations. It contrasts with traditional experiments that often use short, controlled stimuli. This approach aims to capture the natural complexity of language processing in a more realistic setting. The diagram shows the timeline of conversations ('How are you today?' and 'I feel better...'), indicating periods of speech production (purple) and comprehension (green).
  • Types of embeddings extracted from the Whisper model: The figure shows three key types of "embeddings" extracted from the Whisper model: acoustic embeddings, speech embeddings, and language embeddings. Acoustic embeddings represent the raw auditory input to the model. Speech embeddings are taken from the final layer of the Whisper's "encoder," which transforms the acoustic input into a representation of speech sounds. Language embeddings are taken from the "decoder," which converts the speech representation into a representation of the meaning of the words. The figure shows these as different layers, reflecting the hierarchical processing within the Whisper model. The dimensionality reduction using Principal Component Analysis (PCA) to 50 dimensions is mentioned. PCA is a technique to reduce the number of variables while retaining most of the original information. It's like summarizing a large dataset with a smaller set of key features.
  • Linear regression analysis: The figure visually represents the linear regression analysis. This analysis attempts to find a mathematical relationship between the embeddings (acoustic, speech, or language) and the recorded brain activity. It's depicted with equations showing how the embeddings (X) are multiplied by weights (β) to predict neural activity. The 'Beta weights' represent the strength of the relationship between each embedding dimension and the brain activity. The goal is to see how well the model's internal representations (the embeddings) can predict real brain activity during natural conversations. The figure includes a schematic of brain coverage, showing the locations of the electrodes in the four participants (S1-S4).
Scientific Validity
  • Overall methodological approach: The figure presents a valid and innovative approach to studying neural activity during natural conversations. The use of a dense-sampling paradigm and a powerful speech-to-text model (Whisper) is a significant strength. The application of linear regression to relate model embeddings to brain activity is a standard and appropriate method for this type of analysis.
  • Dimensionality reduction using PCA: The use of PCA for dimensionality reduction is justified, given the high dimensionality of the embeddings. However, it would be beneficial to provide more detail about the PCA procedure, such as the amount of variance explained by the 50 components.
  • Extraction of embeddings: The figure clearly outlines the process of extracting embeddings from different layers of the Whisper model. This is crucial for understanding the hierarchical nature of the analysis and the comparison of acoustic, speech, and language representations.
  • Visualization of Brain Coverage: The depiction of brain coverage is helpful, but a more detailed visualization, perhaps showing individual electrode locations, would be beneficial. It's also important to note that the coverage is limited to the left hemisphere, which should be explicitly stated in the figure legend.
Communication
  • Clarity and organization of the visual representation: The figure effectively introduces the core components of the study's methodology, offering a clear visual representation of the data collection and analysis pipeline. The use of distinct colors and labels for different stages (Comprehension, Production, and different embedding types) enhances readability. However, the figure is quite complex and could benefit from a more streamlined layout to improve immediate comprehension, perhaps by separating the production and comprehension pipelines more distinctly.
  • Completeness of the figure legend: The figure legend is concise but could be expanded to provide a more detailed explanation of each component, especially the 'Encoder stack' and 'Decoder stack'. While the main text elaborates on these, a self-contained explanation within the figure caption would improve stand-alone understanding.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 2 | Acoustic, speech and language encoding model performance during speech...
Full Caption

Fig. 2 | Acoustic, speech and language encoding model performance during speech production and comprehension.

Figure/Table Image (Page 4)
Fig. 2 | Acoustic, speech and language encoding model performance during speech production and comprehension.
First Reference in Text
Whisper's acoustic, speech and language embeddings predicted neural activity with remarkable accuracy across conversations compris- ing hundreds of thousands of words during both speech production and comprehension for numerous electrodes in various regions of the cortical language network (Fig. 2).
Description
  • Overview of encoding model performance: This figure presents the results of how well the different types of information extracted from the "Whisper" model (acoustic, speech, and language) can predict brain activity during both speaking (production) and listening (comprehension). The researchers used a technique called "encoding models" to do this. Think of an encoding model as a way to translate between the language of the computer model (the embeddings) and the language of the brain (the neural activity). The better the translation, the better the model is at capturing what's happening in the brain.
  • Color-coded brain maps representing correlation (r): The figure shows brain maps, color-coded to represent the strength of the prediction (correlation, represented by 'r'). A correlation is a number between -1 and 1 that indicates how well two things are related. A correlation of 0 means no relationship, while 1 (or -1) means a perfect positive (or negative) relationship. Here, the colors represent the correlation between the predicted brain activity (based on the Whisper model's embeddings) and the actual recorded brain activity. The colors range from 0.04 (light yellow) to 0.40 (dark red), indicating varying degrees of positive correlation. The N values (N=64, N=274, etc.) indicate the number of electrodes included in each map.
  • Separate panels for production, comprehension, and embedding types: There are separate brain maps for speech production (when people were talking) and speech comprehension (when people were listening). Within each of these, there are maps for the acoustic embeddings (representing the raw sound), speech embeddings (representing the recognized speech sounds), and language embeddings (representing the meaning of the words). This allows us to see which type of information from the Whisper model best predicts brain activity in different brain areas during different tasks.
  • Statistical Significance: The figure shows results that are statistically significant. The statement 'P < 0.01, FWER' means that the probability of observing these results by chance is less than 1%, and this has been corrected for multiple comparisons using the Family-Wise Error Rate (FWER) method. FWER correction is a way to reduce the chances of getting false positives when you're doing many statistical tests at once (in this case, testing many electrodes).
Scientific Validity
  • Overall methodological approach: The figure presents compelling evidence for the alignment between the Whisper model's internal representations and neural activity during natural language processing. The use of a large dataset (hundreds of thousands of words) and multiple electrodes strengthens the generalizability of the findings.
  • Statistical significance: The use of a rigorous statistical threshold (P < 0.01, FWER corrected) provides confidence that the observed correlations are not due to chance.
  • Comparison of different embedding types: The presentation of results for different embedding types (acoustic, speech, and language) allows for a nuanced understanding of how different levels of linguistic information are encoded in the brain.
  • Correlation vs. Causation: While the figure shows impressive results, it's important to acknowledge that correlation does not equal causation. The observed correlations suggest an alignment between the model and the brain, but they do not prove that the brain uses the same representations as the model. Further investigation is needed to explore the causal relationship.
Communication
  • Clarity and organization of the visual representation: The figure effectively visualizes the encoding performance for three different embedding types (acoustic, speech, and language) across multiple brain regions. The use of color-coded brain maps allows for a quick comparison of performance across conditions (production and comprehension). However, the figure could benefit from a clearer indication of the scale for the correlation values (r). While the range is stated (0.04 - 0.40), adding tick marks or labels on the color bar would improve readability.
  • Completeness of the figure legend: The figure legend is concise but could be more informative. For example, explicitly stating that 'N' refers to the number of electrodes would be helpful. Also, clarifying the meaning of the 'P < 0.01, FWER' threshold in the legend would enhance stand-alone understanding.
  • Panel organization and layout: The use of separate panels for production and comprehension, and for each embedding type, makes it easy to compare the results across these different conditions. The layout is logical and well-organized.
Fig. 3 | Mixed selectivity for speech and language embeddings during speech...
Full Caption

Fig. 3 | Mixed selectivity for speech and language embeddings during speech production and comprehension.

Figure/Table Image (Page 5)
Fig. 3 | Mixed selectivity for speech and language embeddings during speech production and comprehension.
First Reference in Text
We observed different selectivity patterns for speech and lan- guage embeddings, each accounting for different portions of the variance across different cortical areas (Fig. 3).
Description
  • Overall concept of mixed selectivity: This figure shows how well different types of information from the Whisper model – specifically, speech sounds (speech embeddings) and the meaning of words (language embeddings) – predict brain activity in different parts of the brain, and how this changes depending on whether someone is talking (production) or listening (comprehension). The main idea is to see which parts of the brain are more sensitive to the sounds of speech versus the meaning of the words.
  • Color-coded brain maps and unique variance explained: The figure uses color-coded brain maps. The color at each location on the brain represents which type of information (speech or language) is better at predicting brain activity in that area. Red means speech sounds are more important, blue means the meaning of the words is more important, and white means it's a mix of both. The colors show the percentage of "unique variance explained." Variance, in this context, is a measure of how much the brain activity changes over time. 'Unique variance explained' means how much of that change can be predicted by only one type of information (either speech or language), after taking into account any overlap between them.
  • Separate maps for production and comprehension: There are separate brain maps for when people are talking (speech production) and when they are listening (speech comprehension). This allows us to see if the patterns of brain activity are different for these two processes.
  • Individual electrode plots showing temporal dynamics: In addition to the brain maps, there are smaller graphs showing the correlation between predicted and actual brain activity over time (the x-axis is labeled 'Lag (s)', meaning time in seconds). These graphs are for specific electrodes in specific brain regions (like IFG, STG, etc.). The red line shows the correlation for speech embeddings (sounds), and the blue line shows the correlation for language embeddings (meaning). These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.
  • Statistical threshold and FDR correction: The dotted horizontal line in each of the smaller graphs represents the statistical threshold. This means that any correlation above that line is considered statistically significant, meaning it's unlikely to have happened by chance. The text mentions that the threshold is q < 0.01, two-sided, FDR corrected. This means the probability of a false positive is less than 1%, and this has been adjusted for multiple comparisons using the False Discovery Rate (FDR) method.
Scientific Validity
  • Overall methodological approach: The figure presents a novel and insightful analysis of the differential roles of speech and language representations in the brain. The use of variance partitioning to quantify the unique contribution of each embedding type is a strong methodological approach.
  • Comparison of production and comprehension: The inclusion of both production and comprehension data allows for a comparison of the neural substrates involved in these two fundamental aspects of language processing.
  • Presentation of group and individual data: The presentation of results at both the group level (brain maps) and the individual electrode level (plots) provides a comprehensive view of the data.
  • Statistical analysis: The statistical analysis appears to be rigorous, with appropriate correction for multiple comparisons (FDR correction).
  • Interpretation of selectivity: The figure focuses on selectivity, which is the relative importance of speech vs. language. It's important to note that even in areas showing strong selectivity for one type of information, the other type might still contribute to neural activity. The figure does not imply that these areas are exclusively involved in processing only one type of information.
Communication
  • Overall organization and clarity: The figure presents a complex set of results, comparing encoding performance for speech and language embeddings during both production and comprehension across various brain regions. The use of separate brain maps for production and comprehension, colored according to the proportion of unique variance explained, is effective for visualizing the spatial distribution of selectivity. The inclusion of individual electrode plots with correlation over time adds another layer of detail. However, the sheer amount of information presented makes the figure somewhat overwhelming. The small size of the individual plots and the lack of clear visual separation between the production and comprehension sections make it challenging to quickly grasp the key findings.
  • Color scheme and representation of mixed selectivity: The color scheme used to represent the percentage of unique variance (ranging from red for speech to blue for language) is intuitive, but the addition of a 'mixed' category (white) adds complexity. It might be helpful to provide a more explicit explanation of what constitutes 'mixed' selectivity in the figure legend.
  • Readability of individual electrode plots: The individual electrode plots are useful for showing the temporal dynamics of encoding performance, but the x-axis labels ('Lag (s)') are small and could be more prominent. Adding tick marks or grid lines to the plots might also improve readability.
  • Use of abbreviations: The use of abbreviations for brain regions (e.g., STG, IFG, preCG) is standard practice, but including a key or expanding these abbreviations in the figure legend would make the figure more accessible to a broader audience.
Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech...
Full Caption

Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech features.

Figure/Table Image (Page 6)
Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech features.
First Reference in Text
In testing both sets of embeddings, we observed that encoding performance for language embeddings was significantly higher when the language decoder received speech information from the encoder, during both production (Fig. 4a) and comprehension (Fig. 4b).
Description
  • Comparison of language embeddings with and without auditory input: This figure compares how well two different types of language information from the Whisper model predict brain activity. The first type ('Only text') is based solely on the written words of the conversation. The second type ('Text + audio') combines the written words with the actual sounds of the speech. The researchers are testing whether adding the sound information improves the prediction of brain activity.
  • Separate panels for production and comprehension: The figure shows results for both when people are talking (production, panel a) and when they are listening (comprehension, panel b). This allows us to see if the effect of adding sound information is different for these two processes.
  • Brain maps showing the difference in correlation: The brain maps show the difference in prediction accuracy between the two types of language information. The colors represent the 'A correlation,' which is the difference in correlation values between the 'Text + audio' model and the 'Only text' model. Warmer colors (closer to 0.050) mean the 'Text + audio' model is better, while cooler colors (closer to -0.050) mean the 'Only text' model is better. The 'N' values indicate the number of electrodes.
  • Line graphs showing correlation over time: The line graphs show the correlation values over time (the x-axis is 'Lag (s)', meaning time in seconds) for all electrodes ('All') and for electrodes in the inferior frontal gyrus ('IFG'). The blue line represents the 'Only text' model, and the pink line represents the 'Text + audio' model. These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.
Scientific Validity
  • Overall methodological approach: The figure provides strong evidence that incorporating auditory speech features improves the encoding performance of language embeddings. This supports the idea that the brain integrates acoustic and linguistic information during both speech production and comprehension.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a comparison of the effects of auditory information in these two processes.
  • Presentation of results at different levels: The presentation of results at both the group level (brain maps) and for specific regions (IFG) provides a more detailed view of the data. However, providing similar plots for other regions (like STG) in supplementary materials could further strengthen the findings.
Communication
  • Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the results for production and comprehension. The use of brain maps and line graphs effectively visualizes the comparison between language embeddings with and without auditory input. However, the brain maps are relatively small, and the color difference between 'Only text' and 'Text + audio' is subtle, making it somewhat difficult to distinguish between them. The line graphs, while informative, could benefit from more prominent axis labels and tick marks.
  • Caption descriptiveness: The caption is concise but could be more descriptive. It would be helpful to explicitly state what 'enhanced encoding' means in this context (i.e., higher correlation values).
  • Consistency and clarity of notation: The use of 'N' to represent the number of electrodes is consistent with previous figures, but a reminder in the legend would still be beneficial for stand-alone understanding.
Fig. 5| Comparing speech and language embeddings to symbolic features.
Figure/Table Image (Page 7)
Fig. 5| Comparing speech and language embeddings to symbolic features.
First Reference in Text
Our findings indicate that speech and language embeddings extracted from the multimodal, deep acoustic-to-speech-to-language model outperform symbolic speech and language features (Fig. 5) in predicting neural activity during natural conversations.
Description
  • Comparison of embeddings and symbolic features: This figure compares two different ways of representing speech and language in the Whisper model: embeddings and symbolic features. Embeddings are the internal representations learned by the deep learning model, while symbolic features are traditional linguistic features like phonemes (speech sounds) and parts of speech (nouns, verbs, etc.). The figure shows how well each of these representations predicts brain activity during speech production and comprehension.
  • Panel (a): Speech embeddings vs. symbolic speech features: Panel (a) focuses on speech. It compares how well the speech embeddings (from the Whisper encoder) and symbolic speech features (like phonemes and articulation features) predict brain activity. The line graphs show the correlation between predicted and actual brain activity over time. The red line represents deep speech embeddings, and the orange line represents symbolic speech features.
  • Panel (b): Language embeddings vs. symbolic language features: Panel (b) focuses on language. It compares how well the language embeddings (from the Whisper decoder) and symbolic language features (like parts of speech and syntactic dependencies) predict brain activity. The blue line represents deep language embeddings, and the light blue line represents symbolic language features.
  • Brain maps showing unique variance explained: The brain maps show the percentage of unique variance explained by each type of representation (deep vs. symbolic). This means how much of the change in brain activity can be predicted by only that type of representation, after taking into account any overlap between them. The color coding for % unique variance explained in the brain maps indicates the relative importance of the representations.
  • Statistical significance: The dotted horizontal line in the line graphs represents a statistical threshold. Correlations above this line are considered statistically significant. The text indicates that red dots (in panel a) and blue dots (panel b) indicate a statistically significant difference in performance between the deep embeddings and the symbolic features.
Scientific Validity
  • Overall methodological approach: The figure provides strong evidence that deep embeddings outperform symbolic features in predicting neural activity during natural conversations. This supports the idea that deep learning models capture aspects of language processing that are not captured by traditional linguistic features.
  • Comparison of different conditions and representations: The use of separate analyses for production and comprehension, and for speech and language, allows for a detailed comparison of the different representations.
  • Analysis of different brain regions: The inclusion of results for different brain regions (All, STG, IFG) provides insights into the spatial distribution of encoding performance.
  • Statistical Analysis: The statistical analysis, including the use of FDR correction, appears to be appropriate.
  • Variance partitioning: The variance partitioning analysis is a strong method for quantifying the unique contribution of each representation.
Communication
  • Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the comparison for speech and language embeddings. Within each section, results are presented for both production and comprehension, and for different brain regions (All, STG, IFG). The use of line graphs to show correlation over time is effective, and the color-coding (deep vs. symbolic) is consistent. However, the brain images showing unique variance explained are quite small and lack detailed anatomical labels, making it difficult to precisely identify the regions where differences are observed.
  • Caption descriptiveness: The caption is concise but could be more informative. It would be helpful to explicitly state what 'symbolic features' are being compared to the embeddings.
  • Readability of line graphs: The x-axis labels ('Lag (s)') on the line graphs are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
  • Color scheme in brain maps: The color scheme used in the brain maps to represent % unique correlation could be confusing. It uses different colors for deep vs symbolic features, where one might expect a gradient of a single color to show % unique correlation. The description in the figure is important to understand the figure.
Fig. 6 | Representations of phonetic and lexical information in Whisper.
Figure/Table Image (Page 8)
Fig. 6 | Representations of phonetic and lexical information in Whisper.
First Reference in Text
a–d, Speech embeddings and language embeddings were visualized in a two- dimensional space using t-SNE (Fig. 6a–d).
Description
  • Use of t-SNE for visualization: This figure uses a technique called t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the high-dimensional information from the Whisper model. t-SNE is a way to take data that has many dimensions (like the embeddings, which have hundreds or thousands of dimensions) and reduce it down to just two dimensions so we can plot it on a graph. The goal of t-SNE is to keep similar data points close together and dissimilar data points far apart in the 2D representation.
  • Separate panels for speech and language embeddings: There are four subpanels (a-d). Panels (a) and (c) show the speech embeddings (information about the sounds of speech), while panels (b) and (d) show the language embeddings (information about the meaning of words).
  • Color-coding by phonetic and lexical categories: Within each pair of panels, one is colored by phonetic categories (like the specific sounds in a word, panel a) and the other is colored by lexical categories (like the part of speech, such as noun, verb, adjective, panel d). This allows us to see if the embeddings naturally cluster together based on these features.
  • Interpretation of point clustering: Each point in the plots represents a single word (or a short segment of audio for the speech embeddings). The closer two points are, the more similar their embeddings are in the high-dimensional space. If points of the same color (meaning the same phonetic or lexical category) tend to cluster together, it suggests that the embeddings are capturing that type of information.
  • Classification accuracy: Panel (e) shows how well a computer algorithm can classify (or guess) the correct phonetic or lexical category of a word based on its embedding. It does this for both speech and language embeddings, and for different layers of the Whisper model. Higher accuracy means the embedding contains more information about that category.
Scientific Validity
  • Use of t-SNE: The use of t-SNE is a valid approach for visualizing high-dimensional data like embeddings. However, it's important to remember that t-SNE is a non-linear dimensionality reduction technique, which means it can distort distances and relationships between data points. It's primarily useful for exploring clustering patterns, not for making precise measurements of distances.
  • Comparison of different embeddings and categories: The comparison of speech and language embeddings, and of phonetic and lexical categories, provides a comprehensive view of the information captured by the different representations.
  • Quantitative analysis using classification: The presentation of classification accuracy (panel e) provides a quantitative measure of the information content of the embeddings, complementing the qualitative visualization provided by the t-SNE plots.
  • Linking representations to neural processes: The figure investigates whether phonetic and lexical categories are represented in the embeddings, but it doesn't directly address how the brain uses this information. It's an important step in understanding the model's internal representations, but further research is needed to link these representations to neural processes.
Communication
  • Overall organization and clarity: The figure is divided into four subpanels (a-d), each showing a t-SNE visualization of either speech or language embeddings, colored by either phonetic or lexical categories. This organization allows for a clear comparison of the clustering patterns. However, the subpanels are relatively small, and the points within each plot are densely packed, making it difficult to discern individual data points and their relationships. The use of different colors for different categories is helpful, but a legend explaining the color-coding is essential and should be more prominently displayed.
  • Caption descriptiveness: The caption is concise, but it could be more specific. It would be helpful to mention that t-SNE is used for visualization and to briefly explain the purpose of the figure (i.e., to examine the clustering of embeddings based on phonetic and lexical features).
  • Missing axis labels: The axis labels are missing in the t-SNE plots, which is a significant omission. While t-SNE axes are not directly interpretable in the same way as traditional coordinate axes, it's still important to indicate that these are t-SNE dimensions (e.g., 't-SNE 1', 't-SNE 2').
  • Missing chance level in classification accuracy: Panel (e) presents classification accuracy of phonetic and lexical categories. The different colors representing different layers is helpful, but it would aid the understanding of readers if horizontal lines representing the chance level were added to the graphs.
Fig. 7 | Temporal dynamics of speech production and speech comprehension across...
Full Caption

Fig. 7 | Temporal dynamics of speech production and speech comprehension across different brain areas.

Figure/Table Image (Page 9)
Fig. 7 | Temporal dynamics of speech production and speech comprehension across different brain areas.
First Reference in Text
Evaluating encoding models at each lag relative to word onset allows us to trace the temporal flow of information from STG (speech comprehen- sion ROI) to IFG (language-related ROI) to SM (speech production ROI) during the production and comprehension of natural conversations (Fig. 7).
Description
  • Overall concept of temporal dynamics: This figure shows how the relationship between the Whisper model's predictions and actual brain activity changes over time, during both speaking (production) and listening (comprehension). The researchers are looking at the timing of brain activity in different areas to see which areas are active first and how the activity flows between them.
  • Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). A correlation measures how well two things are related. Here, it shows how well the model's predictions match the brain activity at different points in time.
  • Separate panels for production and comprehension, and different brain regions: There are separate graphs for speech production (panel a) and speech comprehension (panel b). Within each panel, there are graphs for different brain regions: IFG (inferior frontal gyrus, involved in language), SM (sensorimotor cortex, involved in movement and sensation), and STG (superior temporal gyrus, involved in hearing).
  • Color-coding of lines: The different colored lines in each graph represent different conditions or brain regions. In panels (a) and (b), the blue line represents IFG, the red line represents SM, and the orange line represents STG. In panel (c), the colors represent different parts of the SM: dSM (dorsal), mSM (middle), and vSM (ventral).
  • Statistical significance: The dotted horizontal line, if present, in each graph represents a statistical threshold. Correlations above this line are considered statistically significant. The text indicates that different significance thresholds are used (*P < 0.05, **P < 05, ***P < 0.001). The significance levels are corrected for multiple comparisons
Scientific Validity
  • Overall methodological approach: The figure presents a novel and insightful analysis of the temporal dynamics of speech processing. By examining encoding performance at different time lags, the researchers can infer the order in which different brain regions are involved in production and comprehension.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a direct comparison of the temporal dynamics in these two processes.
  • Analysis of different brain regions: The inclusion of data from multiple brain regions (IFG, SM, STG) provides a more comprehensive view of the network involved in speech processing.
  • Statistical analysis: The statistical analysis, as described in the text, appears to be appropriate, with the use of t-tests and correction for multiple comparisons.
Communication
  • Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the results for production and comprehension. Within each section, line graphs show the correlation between predicted and actual neural activity over time for different brain regions (IFG, SM, STG). The use of different colors for different regions and conditions is effective. However, the figure is quite complex, and the small size of the individual plots makes it challenging to see the details. The brain map (panel d) is helpful for visualizing the electrode locations, but it lacks detailed anatomical labels.
  • Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the specific types of embeddings (speech and language) being analyzed and the key finding (the temporal order of activation).
  • Readability of line graphs: The x-axis labels ('Lag (s)') on the line graphs are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
  • Use of abbreviations: The use of abbreviations (IFG, SM, STG) is standard practice, but expanding these abbreviations in the figure legend or providing a key would make the figure more accessible.
  • Missing y-axis label in Panel c: Panel (c) uses different colors for dSM, mSM, and vSM which helps to see the differences in the temporal dynamics in these regions. However, the y-axis is not labeled, making it difficult to interpret the plot.
Fig. 8 | Fine-grained temporal sequence of speech encoding during production...
Full Caption

Fig. 8 | Fine-grained temporal sequence of speech encoding during production and comprehension.

Figure/Table Image (Page 10)
Fig. 8 | Fine-grained temporal sequence of speech encoding during production and comprehension.
First Reference in Text
We observed that during speech comprehen- sion, neural encoding begins to peak around word onset and gradually shifts over time (Fig. 8b,d).
Description
  • Overall concept of fine-grained temporal sequence: This figure shows how the relationship between the Whisper model's predictions and brain activity changes over very short periods of time (milliseconds) during both speaking (production) and listening (comprehension). It focuses on the timing of when the model's predictions are most strongly related to brain activity.
  • Encoder units (20-ms chunks): The figure uses the Whisper encoder, which processes speech in 20-millisecond chunks, called 'encoder units'. The researchers are looking at how well each of these 20-ms chunks predicts brain activity at different points in time.
  • Line graphs of encoding performance: Panels (a) and (b) show line graphs of encoding performance over time. The x-axis represents the 'Encoder unit' (1 to 20), and the y-axis represents the correlation between the model's predictions and brain activity. Different colored lines represent different brain areas or conditions.
  • Scatter plots with regression lines: Panels (c) and (d) show scatter plots with regression lines. The x-axis represents the 'Encoder unit' (1 to 20), and the y-axis represents the 'Lag (s)', which is the time delay between the word onset and the peak encoding performance. The regression lines show the overall trend, and the statistics (β, P) indicate the slope and significance of the relationship.
  • Shifting encoding peak during comprehension: The key finding for comprehension (panels b and d) is that the encoding peak shifts over time. This means that earlier encoder units (representing earlier parts of the speech signal) predict brain activity earlier, and later encoder units predict brain activity later. This suggests a sequential processing of speech information.
  • Fixed delay of encoding peak during production: The key finding for production (panels a and c) is different. Before word onset, the encoding peak remains at a fixed delay (around -300 ms). This suggests that the brain has information about the entire upcoming word before it's actually spoken.
Scientific Validity
  • Overall methodological approach: The figure presents a novel and insightful analysis of the fine-grained temporal dynamics of speech encoding. By examining encoding performance at the level of individual encoder units (20-ms chunks), the researchers can make inferences about the timing of neural processes.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a direct comparison of the temporal dynamics in these two processes.
  • Regression analysis: The use of regression analysis to quantify the relationship between encoder unit and peak encoding lag is a strong methodological approach.
  • Statistical analysis: The statistical analysis, including the use of linear mixed models and reporting of p-values and confidence intervals, appears to be appropriate.
Communication
  • Overall organization and clarity: The figure is complex, presenting multiple subpanels (a-d) that compare encoding performance across different encoder units and conditions (production and comprehension). The use of line graphs and scatter plots with regression lines is appropriate for visualizing the temporal dynamics. However, the figure is quite dense, and the small size of the individual plots makes it challenging to discern details. The color-coding and symbols are generally clear, but a more explicit explanation of the 'Encoder unit' in the legend would be beneficial.
  • Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the key finding (the shift in encoding peak during comprehension and the fixed delay during production).
  • Readability of graphs: The x-axis and y-axis labels on the line graphs and scatter plots are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
  • Panel organization: The use of separate panels for production (a, c) and comprehension (b, d) is effective for comparing the results across these conditions.
  • Incomplete reference to subpanels: The reference text mentions Fig. 8b,d, but Fig. 8a,c are also relevant and present important information. It would be more appropriate to reference the entire figure (Fig. 8) or all subpanels (Fig. 8a-d) in this sentence.
Supp. Figure 2. Encoding performance for reduced sample size.
Figure/Table Image (Page 21)
Supp. Figure 2. Encoding performance for reduced sample size.
First Reference in Text
Moreover, prediction per- formance in the left-out testing segments was robust and did not meaningfully change even when we used only 25% of the data for train- ing (Supplementary Fig. 2).
Description
  • Overall concept of reduced sample size analysis: This figure shows what happens to the accuracy of the brain activity predictions when the researchers use less data to train their models. They compare the results when using all the available data (x-axis) to the results when using only 25% of the data (y-axis). Each point in the scatter plots represents a single electrode.
  • Separate plots for speech and language embeddings: There are separate plots for speech embeddings (panels A and B) and language embeddings (panels C and D). This allows us to see if the effect of reducing the sample size is different for different types of information.
  • Separate plots for production and comprehension: There are also separate plots for speech production (when people are talking) and speech comprehension (when people are listening).
  • Interpretation of the diagonal line: The diagonal line in each plot represents the scenario where the prediction accuracy is the same with the full dataset and the reduced dataset. If a point falls on this line, it means reducing the sample size didn't change the accuracy for that electrode. Points above the line mean the accuracy was better with less data (which is unlikely), and points below the line mean the accuracy was worse with less data.
  • Key finding: robustness to reduced sample size: The main finding is that most of the points cluster closely around the diagonal line. This means that reducing the sample size to 25% didn't significantly change the prediction accuracy for most electrodes. This suggests that the models are robust and don't require a huge amount of data to achieve good performance.
Scientific Validity
  • Overall methodological approach: The figure provides strong evidence for the robustness of the encoding models to reduced sample size. This is an important finding because it suggests that the results are not overly sensitive to the amount of data used for training.
  • Comparison of different conditions: The use of separate analyses for speech and language embeddings, and for production and comprehension, allows for a comprehensive assessment of robustness across different conditions.
  • Quantitative measure of difference: The figure presents a clear and direct comparison of encoding performance with full and reduced datasets. However, it would be helpful to include a quantitative measure of the difference in performance, such as the mean difference in correlation values or the percentage of electrodes with significantly reduced performance.
Communication
  • Overall organization and clarity: The figure presents scatter plots comparing encoding performance (correlation values) using the full dataset versus a reduced dataset (25% of the data). Separate plots are shown for speech and language embeddings, and for production and comprehension. The use of scatter plots is appropriate for visualizing the relationship between two continuous variables. The diagonal line clearly indicates the expected performance if there were no change with reduced sample size. However, the points are densely packed, making it difficult to assess the distribution and density in different regions of the plots. Using a different plot type, such as a 2D histogram or a density plot, might improve clarity.
  • Caption descriptiveness: The caption is concise but could be more informative. It would be helpful to explicitly state the key finding (i.e., that encoding performance is robust even with reduced sample size).
  • Readability of scatter plots: The axis labels are clear and informative, but adding tick marks or grid lines might improve readability.
Supp. Figure 3. Comparing language embeddings across layers and models.
Figure/Table Image (Page 22)
Supp. Figure 3. Comparing language embeddings across layers and models.
First Reference in Text
We also extracted language embeddings from the decoder stack of layer 4 (instead of layer 3 which was used for Figs. 2-7) and a unimodal language model (GPT-2), and obtained similar encoding results (Supplementary Fig. 3).
Description
  • Comparison of language embeddings across layers and models: This figure explores whether the main findings of the paper (that language embeddings predict brain activity) depend on the specific way the language information is extracted. It compares the results using language embeddings from two different layers of the Whisper model (layer 3 and layer 4 of the decoder) and also from a completely different language model called GPT-2. GPT-2, like Whisper, is a powerful, deep-learning model trained on a massive amount of text, but it was trained only on text, unlike Whisper, which was trained on both text and audio.
  • Brain maps showing correlation values: The brain maps show the correlation between predicted and actual brain activity, similar to previous figures. The colors represent the strength of the correlation, with warmer colors indicating better predictions. Separate maps are shown for speech production (when people are talking) and speech comprehension (when people are listening).
  • Statistical significance: The 'N' values indicate the number of electrodes included in each map. The results are presented for statistically significant electrodes (p < 0.01, FWER corrected), meaning that the observed correlations are unlikely to have happened by chance.
  • Key finding: similar encoding results across layers and models: The main finding is that the encoding results are similar regardless of whether the language embeddings are taken from layer 3 or layer 4 of the Whisper decoder, and also when using embeddings from GPT-2. This suggests that the ability to predict brain activity from language embeddings is not specific to a particular layer or model, but rather reflects a more general property of how language is represented in these models.
Scientific Validity
  • Overall methodological approach: The figure provides important evidence for the robustness of the main findings. By showing that similar results are obtained with different language embeddings (from different layers and a different model), the researchers demonstrate that their findings are not an artifact of a specific methodological choice.
  • Comparison with a unimodal language model: The use of a unimodal language model (GPT-2) is a particularly strong control, as it shows that the results are not dependent on the multimodal nature of the Whisper model.
  • Comparison of production and comprehension: The presentation of results for both production and comprehension allows for a comparison of the robustness across these two processes.
Communication
  • Overall organization and clarity: The figure presents brain maps showing encoding performance (correlation values) for language embeddings extracted from different layers of the Whisper decoder (layer 4 vs. layer 3) and from a different language model (GPT-2). Separate maps are shown for production and comprehension. The use of brain maps allows for a quick visual comparison of performance across conditions. However, the color scale is not explicitly defined on the figure itself, although the range is described in the figure (0.04 - 0.4).
  • Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to state the key finding (i.e., that similar encoding results are obtained across different layers and models).
  • Lack of detailed anatomical labels: The brain maps lack detailed anatomical labels, making it difficult to precisely identify the regions showing significant encoding performance.
Supp. Figure 4. Continuous acoustic and speech encoding model performance...
Full Caption

Supp. Figure 4. Continuous acoustic and speech encoding model performance during speech production and comprehension.

Figure/Table Image (Page 22)
Supp. Figure 4. Continuous acoustic and speech encoding model performance during speech production and comprehension.
First Reference in Text
Because the speech encoder receives continuous speech recordings, we could also run encoding models for continuous acoustic and speech embeddings, encompassing all time points in each recording, including non-speech segments, irrespective of the spoken word boundaries (Supplemen- tary Fig. 4a,b and Methods).
Description
  • Continuous encoding analysis: This figure extends the analysis beyond individual words and looks at how well the Whisper model's representations predict brain activity continuously over time, even during periods without speech. This contrasts with the previous figures, which focused on brain activity aligned to specific words.
  • Comparison of continuous acoustic and speech embeddings: The figure compares two types of information from the Whisper model: continuous acoustic embeddings (representing the raw sound) and continuous speech embeddings (representing the recognized speech sounds, but still over the entire time course, not just at word boundaries).
  • Brain maps of encoding performance: Panels (A) and (B) show brain maps of encoding performance for the continuous acoustic and speech embeddings, respectively. The colors represent the correlation between predicted and actual brain activity. Separate maps are shown for speech production (when people are talking) and speech comprehension (when people are listening).
  • Difference in encoding performance: Panel (C) shows the difference in encoding performance between the continuous speech and acoustic embeddings. Red indicates areas where speech embeddings perform better, and blue indicates areas where acoustic embeddings perform better.
  • Line graphs of correlation over time: Panel (D) shows line graphs of the correlation over time for all electrodes. The red line represents the continuous speech embeddings, and the blue line represents the continuous acoustic embeddings.
  • Key finding: speech embeddings outperform acoustic embeddings: The key finding is that even when considering continuous signals (including non-speech segments), speech embeddings outperform acoustic embeddings in predicting brain activity in most brain areas. This strengthens the evidence that the speech representations in the Whisper model capture important aspects of neural processing.
Scientific Validity
  • Overall methodological approach: The figure presents a valuable extension of the main analysis, demonstrating that the superiority of speech embeddings is not limited to word-aligned activity but also holds for continuous signals. This addresses a potential concern that the previous results might be an artifact of focusing only on word onsets and offsets.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a comparison of the continuous encoding performance across these two processes.
  • Comparison of acoustic and speech embeddings: The inclusion of results for both acoustic and speech embeddings provides a clear comparison of the different representations.
  • Statistical Analysis: The statistical analysis, as described in the Methods section and referenced in the text, appears to be appropriate. The contrast in panel C uses FDR correction.
Communication
  • Overall organization and clarity: The figure is divided into multiple panels (A-D), presenting brain maps and line graphs comparing encoding performance for continuous acoustic and speech embeddings. Separate maps and graphs are shown for production and comprehension. The use of brain maps allows for a quick visual comparison of performance across brain regions, and the line graphs provide a visualization of the temporal dynamics. However, the brain maps are relatively small and lack detailed anatomical labels. The color scales are not explicitly defined on the figure, although the range is mentioned for panel C.
  • Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to explicitly state the key finding (that speech embeddings outperform acoustic embeddings even when considering continuous signals).
  • Readability of line graphs: The x-axis and y-axis labels on the line graphs (Panel D) are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
Supp. Figure 5. Unique variance explained by acoustic, speech, and language...
Full Caption

Supp. Figure 5. Unique variance explained by acoustic, speech, and language embeddings.

Figure/Table Image (Page 23)
Supp. Figure 5. Unique variance explained by acoustic, speech, and language embeddings.
First Reference in Text
A similar analysis was also done for acoustic and speech embeddings (Supplementary Fig. 5).
Description
  • Comparison of acoustic and speech embeddings: This figure is similar to Figure 3, but instead of comparing speech and language embeddings, it compares acoustic and speech embeddings. It shows how much of the change in brain activity can be predicted by only the raw sound information (acoustic embeddings) versus only the recognized speech sounds (speech embeddings), after accounting for any overlap between them.
  • Color-coded brain maps and unique variance explained: The brain maps are color-coded to represent the percentage of unique variance explained. This means how much of the change in brain activity can be predicted by one type of information (acoustic or speech) but not the other.
  • Separate maps for production and comprehension: There are separate maps for speech production (when people are talking) and speech comprehension (when people are listening). This allows us to see if the patterns are different for these two processes.
  • Key finding: speech embeddings explain more unique variance: The key finding is that in most brain areas, speech embeddings (representing the recognized speech sounds) explain more unique variance than acoustic embeddings (representing the raw sound). This suggests that the brain is more sensitive to the higher-level speech features than to the raw acoustic details.
Scientific Validity
  • Overall methodological approach: The figure provides a valuable comparison of the unique contributions of acoustic and speech representations to neural activity. This analysis complements the main findings and helps to disentangle the roles of different levels of auditory processing.
  • Variance partitioning: The use of variance partitioning is a strong method for quantifying the unique contribution of each embedding type.
  • Comparison of production and comprehension: The presentation of results for both production and comprehension allows for a comparison of the effects across these two processes.
Communication
  • Overall organization and clarity: The figure presents brain maps showing the percentage of unique variance explained by acoustic and speech embeddings during production and comprehension. The use of color-coded brain maps is effective for visualizing the spatial distribution of variance explained. However, the color scale is not explicitly defined on the figure, and the brain maps lack detailed anatomical labels, making it difficult to precisely identify the regions showing significant differences. The range of colors is also very small and difficult to interpret.
  • Caption descriptiveness: The caption is informative, but it could be more specific. It would be helpful to state the key finding (that speech embeddings explain more unique variance than acoustic embeddings in most regions).
Supp. Figure 8. Comparing speech and languaging based encoding for...
Full Caption

Supp. Figure 8. Comparing speech and languaging based encoding for comprehension and production.

Figure/Table Image (Page 27)
Supp. Figure 8. Comparing speech and languaging based encoding for comprehension and production.
First Reference in Text
Supplementary Figs. 6 and 8 display the mean encoding results during production and comprehension in three ROIs (SM, IFG and STG) per patient.
Description
  • Comparison of speech and language embeddings for production and comprehension: This supplementary figure compares how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in different brain areas, separately for when people are talking (production) and when they are listening (comprehension). It's similar to Figure 3, but this figure focuses on comparing production and comprehension directly.
  • Multiple brain regions analyzed: The figure shows results for several different brain regions, including preCG (precentral gyrus), postCG (postcentral gyrus), TP (temporal pole), STG (superior temporal gyrus), IFG (inferior frontal gyrus), pMTG (posterior middle temporal gyrus), and AG (angular gyrus). These regions are involved in different aspects of speech and language processing.
  • Line graphs showing correlation over time: Each small graph shows the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The different colored lines represent different conditions (production or comprehension).
  • Region-specific results: The figure presents results for individual brain regions, allowing for a comparison of the temporal dynamics of encoding performance across different areas.
Scientific Validity
  • Overall methodological approach: The figure provides a valuable comparison of encoding performance for speech and language embeddings during both production and comprehension, across multiple brain regions. This allows for a more nuanced understanding of the neural substrates involved in these processes.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension is crucial for understanding the differences in neural dynamics between these two processes.
  • Analysis of multiple brain regions: The inclusion of data from multiple brain regions provides a more comprehensive view of the network involved in speech and language processing.
  • Averaged results: The reference text indicates that these are mean encoding results, suggesting that the data has been averaged across participants. While this provides a general overview, it's important to also consider individual variability, as shown in Supplementary Figure 6.
Communication
  • Overall organization and clarity: The figure presents a complex set of results, comparing encoding performance for speech and language embeddings during both production and comprehension. The use of separate panels for different brain regions (preCG, postCG, TP, STG, IFG, pMTG, AG) and for production/comprehension is effective for organizing the information. However, the figure is extremely dense, and the small size of the individual plots, combined with the lack of clear y-axis scales and tick marks, makes it very difficult to interpret the results. The color-coding (purple for production, green for comprehension) is consistent, but a more explicit legend would be beneficial.
  • Caption descriptiveness and terminology: The caption is informative but could be more specific. The term "languaging" is not standard terminology and should be replaced with "language". It would also be helpful to restate the definitions of the brain regions (preCG, postCG, etc.) in the legend.
  • Readability of plots: The x-axis ('Lag (s)') is consistent across plots, but the y-axis ('Correlation (r)') lacks a clear scale and tick marks, making it difficult to compare correlation values across plots and conditions.
Supp. Figure 10. Evidence for speech processing of the speaker's own voice...
Full Caption

Supp. Figure 10. Evidence for speech processing of the speaker's own voice during speech production.

Figure/Table Image (Page 29)
Supp. Figure 10. Evidence for speech processing of the speaker's own voice during speech production.
First Reference in Text
In contrast, the second peak occurs -200 ms after word onset. Additional analyses indicate that the first peak is associated with motor planning, while the second peak is associated with the speaker processing their own voice (Supplementary Fig. 10).
Description
  • Focus on post-word-onset activity during production: This supplementary figure investigates what happens in the brain when someone is talking, specifically focusing on the period shortly after they say a word. It builds on the observation that some brain areas show two peaks of activity related to speech: one before the word is spoken (related to planning the speech) and one after (potentially related to hearing your own voice).
  • Speech embeddings and brain regions: The figure shows how well the Whisper model's speech embeddings (representing the recognized speech sounds) can predict brain activity in different brain areas (STG, superior temporal gyrus, involved in hearing; and SM, sensorimotor cortex, involved in movement and sensation).
  • Comparison of production-trained and comprehension-trained models: The key idea is to compare two scenarios: (1) predicting brain activity using a model trained on data from when people are talking (production, red line), and (2) predicting brain activity using a model trained on data from when people are listening (comprehension, green line).
  • Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken).
  • Key finding: second peak related to self-monitoring: The main finding is that the second peak of activity (after the word is spoken) is better predicted by the model trained on production data. This suggests that this second peak is related to the speaker processing their own voice, rather than just general speech processing.
Scientific Validity
  • Overall methodological approach: The figure provides supporting evidence for the interpretation of the double peak in activity observed during speech production. By comparing the encoding performance of models trained on production and comprehension data, the researchers can make inferences about the functional roles of the two peaks.
  • Comparison of different brain regions: The use of separate analyses for STG and SM allows for a comparison of the effects across different brain regions.
  • Focus on electrodes with double peak: The focus on electrodes showing a double peak is justified, as these electrodes are most likely to be involved in both motor planning and auditory feedback processing.
  • Correlational evidence: The figure provides correlational evidence, and further experiments would be needed to definitively establish a causal link between the second peak and self-monitoring.
Communication
  • Overall organization and clarity: The figure presents encoding performance for speech embeddings during speech production, focusing on electrodes showing a double peak in activity (one before and one after word onset). Separate plots are shown for different brain regions (STG, SM) and for two conditions: training the encoding model on production data and training it on comprehension data. The use of line graphs is appropriate for visualizing the temporal dynamics. However, the figure is dense, and the y-axis scale (Correlation (r)) is small and lacks tick marks, making it difficult to compare correlation values across plots. The color-coding (red for production training, green for comprehension training) is consistent, but a more explicit legend would be beneficial.
  • Caption descriptiveness: The caption is informative but could be more precise. It would be helpful to explicitly state the key finding (that the second peak is more strongly predicted by the production-trained model).
  • Readability of line graphs: The x-axis label ('Lag (s)') is consistent, but adding tick marks or grid lines to the plots would improve readability.
  • Brain map clarity: The brain map showing electrode locations is helpful, but it lacks detailed anatomical labels and could be improved by indicating which electrodes show a double peak.
Supp. Table 1. Patient demographics and clinical characteristics.
Figure/Table Image (Page 31)
Supp. Table 1. Patient demographics and clinical characteristics.
First Reference in Text
We collected continuous 24/7 recordings of ECOG and speech signals from 4 patients as they spontaneously conversed with their family, friends, doctors and hospital staff during their entire days-long stay at the epilepsy unit (for patient demographics and clinical character- istics, see Supplementary Table 1).
Description
  • Overall purpose of the table: This supplementary table provides background information about the four people who participated in the study. It includes details about their age, sex, the type of electrodes implanted in their brains, how long their brain activity and speech were recorded, and other relevant medical information.
  • Demographic information (Age and Sex): The 'Age' column shows the age of each participant, ranging from 24 to 53 years. The 'Sex' column indicates whether each participant is male (M) or female (F).
  • Number of electrodes implanted: The 'Number of electrodes implanted' column shows how many electrodes were used to record brain activity for each participant. This number varies considerably across participants (104 to 255).
  • Recording duration and number of words: The 'Hours of speech recorded' column shows how many hours of speech were recorded for each participant, ranging from 17 to 37 hours. The 'Number of words' columns show the total number of words recorded, as well as the number of words spoken by the participant (production) and the number of words spoken by others (comprehension).
  • Neuropsychological testing scores: The 'Neuropsychological testing scores' section presents scores on various cognitive tests, including VCI (Verbal Comprehension Index), POI (Perceptual Organization Index), PSI (Processing Speed Index), and WMI (Working Memory Index). These scores provide information about the participants' cognitive abilities.
  • Clinical characteristics (Pathology/epilepsy type/seizure focus): The 'Pathology/epilepsy type/seizure focus' section describes the medical condition of each participant, including the type of epilepsy and the location of seizure onset. All participants had epilepsy that was resistant to medication.
  • Implant type: The 'Implant' section specifies the type of electrodes used for each participant (grid, strips, and/or depth electrodes).
Scientific Validity
  • Relevance to scientific interpretation: The table provides essential information for understanding the characteristics of the study participants. This information is crucial for assessing the generalizability of the findings and for interpreting the neural data in the context of individual differences.
  • Comprehensive information: The inclusion of both demographic and clinical information is important, as both factors can influence neural activity and language processing.
  • Sample size: The sample size is small (N=4), which is a limitation of the study. However, the extensive amount of data collected from each participant (hours of continuous recordings) partially compensates for the small sample size.
  • Selection bias: It is important to consider potential biases introduced by the selection of participants. All participants had drug-resistant epilepsy and were undergoing intracranial monitoring, which may limit the generalizability of the findings to the broader population.
Communication
  • Overall organization and clarity: The table presents key demographic and clinical information about the four participants in the study. The use of a table is appropriate for summarizing this type of data. The table is well-organized, with clear column headings and row labels. However, it could be improved by adding units to the numerical values (e.g., 'years' for age, 'hours' for recording duration). The Neuropsychological testing scores are presented without stating the maximum possible scores, making it difficult to contextualize.
  • Caption descriptiveness: The caption is concise and informative.
  • Inclusion of relevant information: The inclusion of information about implant type and pathology/epilepsy type/seizure focus provides important context for interpreting the neural data.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Supp. Figure 1. Summary statistics of conversations (A) Distribution of...
Full Caption

Supp. Figure 1. Summary statistics of conversations (A) Distribution of temporal word duration.

Figure/Table Image (Page 21)
Supp. Figure 1. Summary statistics of conversations (A) Distribution of temporal word duration.
First Reference in Text
Not explicitly referenced in main text
Description
  • Overall description: This figure provides information about the words used in the conversations that were recorded and analyzed. It shows two graphs, both are histograms. Histograms are like bar graphs that show how frequently different values occur in a dataset.
  • Panel (A): Distribution of temporal word duration: Panel (A) shows the distribution of word durations, measured in milliseconds (ms). A millisecond is one-thousandth of a second. The x-axis (horizontal axis) shows the duration of the words, ranging from 0 to 2000 ms (2 seconds). The y-axis (vertical axis) shows the frequency, meaning how many words had that particular duration. The graph shows that most words are relatively short, with a peak around a few hundred milliseconds. There are fewer very long words.
  • Panel (B): Distribution of number of characters in words: Panel (B) shows the distribution of word lengths, measured in the number of characters. The x-axis shows the number of characters, ranging from 0 to 50. The y-axis shows the frequency, meaning how many words had that particular number of characters. The graph shows that most words have a relatively small number of characters, with a peak around 5-10 characters. There are fewer very long words.
  • Shape of distribution: The shape of both distributions is skewed to the right. It is expected as it is very uncommon to have long words in comparison with short words.
Scientific Validity
  • Overall methodological approach: The figure provides basic descriptive statistics of the conversational data, which is important for characterizing the dataset. The use of histograms is appropriate for visualizing the distributions of word duration and length.
  • Lack of summary statistics: The figure lacks any statistical analysis beyond the basic distributions. It would be helpful to include summary statistics such as the mean, median, and standard deviation of word duration and length.
  • Accuracy of word segmentation and alignment: It's important to ensure that the word segmentation and alignment were accurate, as errors in these processes could affect the distributions shown in the figure.
Communication
  • Overall organization and clarity: The figure presents two histograms: (A) showing the distribution of word durations in milliseconds, and (B) showing the distribution of word lengths in characters. The use of histograms is appropriate for visualizing the distribution of continuous and discrete variables, respectively. However, the y-axis label 'frequency' is not very informative. It would be better to specify 'Number of words' or 'Frequency (number of words)'. The x-axis labels are clear, but adding tick marks or grid lines might improve readability.
  • Caption descriptiveness: The caption is informative, but could be more specific. It would be helpful to state the total number of words analyzed.
  • Lack of explicit reference in main text: Since the reference text states that the figure is not explicitly referenced, it may be less critical to the main findings of the paper and may benefit from additional context provided in the supplementary text.
Supp. Figure 6. Mixed selectivity for speech and language embeddings during...
Full Caption

Supp. Figure 6. Mixed selectivity for speech and language embeddings during speech production and comprehension.

Figure/Table Image (Page 24)
Supp. Figure 6. Mixed selectivity for speech and language embeddings during speech production and comprehension.
First Reference in Text
Not explicitly referenced in main text
Description
  • Individual subject results: This supplementary figure expands on Figure 3 by showing the results for individual participants, rather than averaging across all participants. It shows how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in different brain areas for each person, separately for when they are talking (production) and when they are listening (comprehension).
  • Line graphs showing correlation over time: Each small graph shows the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The red line represents speech embeddings (sounds), and the blue line represents language embeddings (meaning).
  • Separate graphs for different brain regions and participants: There are separate graphs for different brain regions (IFG, SM, STG) and for each of the four participants (S1, S2, S3, S4). This allows us to see if the patterns of brain activity are consistent across different people and different brain areas.
  • Statistical threshold: The dashed horizontal line in each graph represents a statistical threshold. Correlations above this line are considered statistically significant, meaning they're unlikely to have happened by chance.
Scientific Validity
  • Individual subject analysis: The figure provides valuable information about the inter-individual variability in encoding performance. By presenting results for each participant separately, the researchers can assess the consistency of the findings across individuals.
  • Consistency with main analyses: The use of separate analyses for production and comprehension, and for speech and language embeddings, is consistent with the main analyses in the paper.
  • Complementary to main figure: The figure complements Figure 3 by providing a more detailed view of the data at the individual subject level. However, it doesn't introduce any new methodological approaches.
Communication
  • Overall organization and clarity: The figure presents line graphs showing encoding performance (correlation values) for speech and language embeddings during both production and comprehension. Separate graphs are shown for different brain regions and for each of the four participants (S1-S4). The use of individual subject plots allows for an assessment of inter-individual variability. However, the figure is very dense, and the small size of the individual plots makes it difficult to discern details. The x-axis (Lag (s)) is consistent, but the y-axis (Correlation (r)) lacks a clear scale and tick marks, making it hard to compare correlation values across plots. The dashed horizontal line representing the statistical threshold is not explicitly defined in the legend.
  • Caption descriptiveness: The caption is informative but could be more specific. It would be beneficial to restate the definition of mixed selectivity and to highlight the key findings observed in the individual subject plots.
  • Legend completeness: The legend could be improved by explicitly defining the dashed horizontal line and by providing a more detailed explanation of the color-coding (red for speech embeddings, blue for language embeddings).
Supp. Figure 7. Average speech and language encoding across ROIs.
Figure/Table Image (Page 26)
Supp. Figure 7. Average speech and language encoding across ROIs.
First Reference in Text
Not explicitly referenced in main text
Description
  • Average encoding performance across ROIs: This supplementary figure shows the average results across all participants for how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in three key brain areas: IFG (inferior frontal gyrus, involved in language), SM (sensorimotor cortex, involved in movement and sensation), and STG (superior temporal gyrus, involved in hearing). It averages the results shown in Supplementary Figure 6.
  • Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The red line represents speech embeddings (sounds), and the blue line represents language embeddings (meaning).
  • Separate graphs for production and comprehension: There are separate graphs for speech production (when people are talking) and speech comprehension (when people are listening). This allows us to see if the patterns of brain activity are different for these two processes.
  • Averaged across participants: The 'N' values indicate the number of electrodes included in each average. These are averaged results across all participants, unlike Supp Fig 6, which showed individual participant results.
Scientific Validity
  • Averaged results across participants: The figure provides a useful summary of the encoding performance across different ROIs, complementing the individual subject results presented in Supplementary Figure 6. Averaging across participants can reveal general trends, but it can also mask inter-individual variability.
  • Consistency with main analyses: The use of separate analyses for production and comprehension, and for speech and language embeddings, is consistent with the main analyses in the paper.
  • Focus on average performance: The figure focuses on the average encoding performance within each ROI. It's important to remember that there may be significant variation in encoding performance within each ROI, as shown in previous figures.
Communication
  • Overall organization and clarity: The figure presents line graphs showing the average encoding performance (correlation values) for speech and language embeddings, averaged across electrodes within specific regions of interest (ROIs): IFG, SM, and STG. Separate graphs are shown for production and comprehension. The use of line graphs is appropriate for visualizing the temporal dynamics. The color-coding (red for speech, blue for language) is consistent with previous figures. However, the figure is quite dense, and the y-axis scale (Correlation (r)) is small and lacks tick marks, making it difficult to compare correlation values across plots. The N values, indicating the number of electrodes, are clearly presented.
  • Caption descriptiveness: The caption is informative, but it could be more precise. It would be helpful to explicitly mention that the results are averaged across participants and to restate the definitions of the ROIs (IFG, SM, STG).
  • Readability of line graphs: The x-axis label ('Lag (s)') is consistent, but adding tick marks or grid lines to the plots would improve readability.
Supp. Figure 9. Representations of phonetic and lexical information in Whisper.
Figure/Table Image (Page 28)
Supp. Figure 9. Representations of phonetic and lexical information in Whisper.
First Reference in Text
Not explicitly referenced in main text
Description
  • Overall purpose of the figure: This supplementary figure explores how well different layers of the Whisper model capture information about the sounds of speech (phonetics) and the meaning of words (lexical information). It does this in two main ways: using t-SNE plots to visualize the data and using classification accuracy to quantify how well the model can predict different categories.
  • t-SNE visualizations (a-d): Panels (a-d) show t-SNE plots. t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique to visualize high-dimensional data in a two-dimensional space. It tries to keep similar data points close together and dissimilar data points far apart. Each point in these plots represents either a short segment of audio (for speech embeddings) or a single word (for language embeddings).
  • Comparison of speech and language embeddings, phonetic and lexical categories: Panels (a) and (b) show speech embeddings, while panels (c) and (d) show language embeddings. Within each pair, one plot is colored by phonetic categories (like manner of articulation or place of articulation), and the other is colored by lexical categories (like part of speech).
  • Classification accuracy (e): Panel (e) shows classification accuracy. This measures how well a computer algorithm can predict the correct phonetic or lexical category of a word or sound segment based on its embedding from different layers of the Whisper model. Higher accuracy means the embedding contains more information about that category.
  • Different layers of the Whisper model: The different colors in panel (e) represent different layers of the Whisper model (speech 0-4, language 0-4). This allows us to see how the representation of phonetic and lexical information changes across different layers of the model.
Scientific Validity
  • Overall methodological approach: The figure provides a valuable extension of the main analyses, exploring the internal representations of the Whisper model in more detail. The use of both t-SNE visualization and classification accuracy provides a comprehensive assessment of phonetic and lexical information.
  • Comparison of different layers: The comparison of different layers of the Whisper model allows for an investigation of how the representation of phonetic and lexical information changes across the processing hierarchy.
  • Multiple phonetic and lexical categories: The use of multiple phonetic categories (phoneme, PoA, MoA) and a lexical category (PoS) provides a more fine-grained analysis of the information captured by the embeddings.
  • Quantitative and qualitative analysis: The classification analysis provides a quantitative measure of the information content of the embeddings, complementing the qualitative visualization provided by the t-SNE plots.
Communication
  • Overall organization and clarity: The figure presents a comprehensive analysis of phonetic and lexical information representation across different layers of the Whisper model, using both t-SNE visualizations (a-d) and classification accuracy plots (e). The organization into subpanels is logical, with (a-d) focusing on t-SNE and (e) on classification. However, the t-SNE plots are quite small and densely packed, making it difficult to discern individual data points and their relationships. The color-coding in the t-SNE plots is not clearly explained in the legend. The classification accuracy plots (e) lack a clear indication of chance level performance, making it difficult to assess the significance of the obtained accuracies.
  • Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the use of t-SNE and classification accuracy as the main analysis methods.
  • Missing axis labels in t-SNE plots: The t-SNE plots (a-d) lack axis labels, which is a significant omission. While t-SNE axes are not directly interpretable in the same way as traditional coordinate axes, it's still important to indicate that these are t-SNE dimensions (e.g., 't-SNE 1', 't-SNE 2').
  • Missing chance level in classification accuracy plots: The classification accuracy plots (e) should include horizontal lines indicating chance-level performance for each category (phoneme, PoA, MoA, PoS). This would provide a crucial baseline for evaluating the significance of the reported accuracies.
Supp. Table 2. Distribution of part of speech for all words in our dataset
Figure/Table Image (Page 32)
Supp. Table 2. Distribution of part of speech for all words in our dataset
First Reference in Text
Not explicitly referenced in main text
Description
  • Overall purpose of the table: This supplementary table shows how often different types of words, called "parts of speech," appear in the conversations that were recorded. Parts of speech are categories like nouns (names of things), verbs (actions), adjectives (describing words), and so on.
  • Production vs. Comprehension: The table is divided into two main columns: 'Prod Frequency' (production frequency) and 'Comp Frequency' (comprehension frequency). 'Prod Frequency' shows how many times each part of speech was used by the participants when they were talking. 'Comp Frequency' shows how many times each part of speech was used by other people when talking to the participants.
  • Parts of speech categories: Each row in the table represents a different part of speech, such as NOUN (noun), VERB (verb), PRON (pronoun), ADP (adposition, like prepositions and postpositions), ADV (adverb), DET (determiner), ADJ (adjective), CONJ (conjunction), PRT (particle), NUM (numeral), X (other), and . (punctuation).
  • Frequency counts: The numbers in the table show the raw counts of how many times each part of speech appeared in each category (production or comprehension). For example, the number 64835 in the 'Prod Frequency' column and the 'NOUN' row means that the participants used nouns 64835 times when they were talking.
Scientific Validity
  • Relevance to scientific interpretation: The table provides useful information about the linguistic characteristics of the dataset. Knowing the distribution of parts of speech can be important for understanding the nature of the conversations and for interpreting the neural data in the context of different types of linguistic content.
  • Raw frequencies vs. percentages: The table presents raw frequency counts. It would be helpful to also include percentages or relative frequencies to facilitate comparison across different parts of speech and between production and comprehension.
  • Accuracy of part-of-speech tagging: The accuracy of the part-of-speech tagging is crucial for the validity of the table. The researchers should describe the method used for part-of-speech tagging and report its accuracy.
Communication
  • Overall organization and clarity: The table presents the frequency of different parts of speech (e.g., NOUN, VERB, PRON, etc.) in the dataset, separated into production (words spoken by the participants) and comprehension (words spoken to the participants). The use of a table is appropriate for this type of data. The table is well-organized, with clear column headings and row labels. However, it would be helpful to include the total number of words in each category (production and comprehension) and to present percentages in addition to raw frequencies.
  • Caption descriptiveness: The caption is concise and informative.
  • Use of abbreviations: The abbreviations used for parts of speech (e.g., ADP, ADV, DET) are relatively standard, but it would be beneficial to provide a key or expand these abbreviations in the table legend for clarity.
Supp. Table 3. Symbolic speech and linguistic features.
Figure/Table Image (Page 33)
Supp. Table 3. Symbolic speech and linguistic features.
First Reference in Text
Not explicitly referenced in main text
Description
  • Overall purpose of the table: This supplementary table lists the traditional linguistic features that the researchers used as a comparison to the Whisper model's embeddings. These features represent speech and language in a symbolic way, using categories and labels, rather than the continuous, high-dimensional representations of the deep learning model.
  • Division into speech and linguistic features: The table is divided into two main sections: 'Symbolic Speech Features' and 'Symbolic Linguistic Features'. Symbolic speech features are related to the sounds of speech, while symbolic linguistic features are related to the meaning and structure of words and sentences.
  • Symbolic Speech Features: The 'Symbolic Speech Features' section includes: Phonemes (the individual sounds in a word), Place of Articulation (where in the mouth the sound is made), Manner of Articulation (how the sound is made), and Voice or Voiceless (whether the vocal cords vibrate or not). The 'Feature Categories / Dimensions' column shows the number of different categories or values for each feature. For example, there are 39 different phoneme categories.
  • Symbolic Linguistic Features: The 'Symbolic Linguistic Features' section includes: Part of Speech (noun, verb, adjective, etc.), Dependency (how words relate to each other in a sentence), Prefix (a group of letters at the beginning of a word), Suffix (a group of letters at the end of a word), and Stop Word (common words like 'the', 'a', 'is', etc. that are often removed in natural language processing).
  • Sum of dimensions: The 'Sum' row indicates the total number of dimensions for each section (60 for speech features and 137 for linguistic features). These numbers reflect the total number of binary features used to represent each word, when using a one-hot encoding scheme
Scientific Validity
  • Transparency and reproducibility: The table provides a clear and comprehensive list of the symbolic features used in the study. This is important for transparency and reproducibility, allowing other researchers to understand and potentially replicate the analysis.
  • Rationale for feature selection: The choice of symbolic features appears to be reasonable and covers a range of relevant aspects of speech and language. However, the rationale for selecting these specific features (and not others) is not explicitly stated.
  • Missing methodological details: The table lists the features used, but it doesn't describe the method used to extract or annotate these features (e.g., how phonemes were determined, how part-of-speech tagging was performed). This information should be provided in the Methods section.
Communication
  • Overall organization and clarity: The table lists the symbolic speech and linguistic features used in the study, along with the number of categories or dimensions for each feature. The table is well-organized, with clear row and column labels. The separation into 'Symbolic Speech Features' and 'Symbolic Linguistic Features' is logical. However, it would be beneficial to provide a brief explanation or definition of each feature within the table, or in a separate supplementary note, to make it more accessible to readers who may not be familiar with all the terms.
  • Caption descriptiveness: The caption is concise and informative.
  • Use of abbreviations: The use of abbreviations (e.g., PoA, MoA) could be expanded upon in the table legend or in a footnote for better clarity.
↑ Back to Top